details widget name

Text mining extracts visualization

Chapter details

Important phrases

All important phrases extracted by our LPC's have associated weight values indicating how important the selected phrase is. The phrases are displayed in a tag cloud according to these weights. They are divided into 7 groups, each of them rendered with different font size and color – the 'heavier' (more important) are bigger and darker.

We observed that it is common, especially for relatively smaller documents, to have a lot of phrases with similar weights, for instance 30-40 out of 100. This leads to the formation of too large or empty groups which is not desirable. Therefore, when the selected number of most important phrases is retrieved, their weights are normalized and a certain small random distortion is applied to each value to assure that there are no larger groups with equal weights.

Retrieved important phrases are then divided into 7 groups by a simple k-Means clustering. In order for the user to be able to distinguish the most important phrases, the initial clustering result is further processed in attempt to reach the 'perfect' distribution of items among the clusters – if the clusters are numbered from 0 to 6 in ascending order according to the weights of the phrases and average is the expected number of phrases in each cluster if all phrases were equally distributed among clusters, then the expected items in the i-th cluster would be:

average
log(i+1) + 1

Thus, the largest items would be fewer (the largest cluster would be smaller) and would really be the most important ones.

This result is achieved by iteratively dividing the largest (relative to its expected size) cluster into smaller clusters. The first one, or even two if possible, of the newly formed 'subclusters' are merged with the previous cluster, and the last one – with the next, but only if the next cluster is not already oversized. This algorithm ensures that items would be moved to the smaller (by value) clusters. The operation is performed a fixed number of iterations and the best result (with the smaller error) is displayed.